zomato.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 51717 entries, 0 to 51716 Data columns (total 17 columns): url 51717 non-null object address 51717 non-null object name 51717 non-null object online_order 51717 non-null object book_table 51717 non-null object rate 43942 non-null object votes 51717 non-null int64 phone 50509 non-null object location 51696 non-null object rest_type 51490 non-null object dish_liked 23639 non-null object cuisines 51672 non-null object approx_cost(for two people) 51371 non-null object reviews_list 51717 non-null object menu_item 51717 non-null object listed_in(type) 51717 non-null object listed_in(city) 51717 non-null object dtypes: int64(1), object(16) memory usage: 6.7+ MB
zomato.head()
| url | address | name | online_order | book_table | rate | votes | phone | location | rest_type | dish_liked | cuisines | approx_cost(for two people) | reviews_list | menu_item | listed_in(type) | listed_in(city) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | https://www.zomato.com/bangalore/jalsa-banasha... | 942, 21st Main Road, 2nd Stage, Banashankari, ... | Jalsa | Yes | Yes | 4.1/5 | 775 | 080 42297555\r\n+91 9743772233 | Banashankari | Casual Dining | Pasta, Lunch Buffet, Masala Papad, Paneer Laja... | North Indian, Mughlai, Chinese | 800 | [('Rated 4.0', 'RATED\n A beautiful place to ... | [] | Buffet | Banashankari |
| 1 | https://www.zomato.com/bangalore/spice-elephan... | 2nd Floor, 80 Feet Road, Near Big Bazaar, 6th ... | Spice Elephant | Yes | No | 4.1/5 | 787 | 080 41714161 | Banashankari | Casual Dining | Momos, Lunch Buffet, Chocolate Nirvana, Thai G... | Chinese, North Indian, Thai | 800 | [('Rated 4.0', 'RATED\n Had been here for din... | [] | Buffet | Banashankari |
| 2 | https://www.zomato.com/SanchurroBangalore?cont... | 1112, Next to KIMS Medical College, 17th Cross... | San Churro Cafe | Yes | No | 3.8/5 | 918 | +91 9663487993 | Banashankari | Cafe, Casual Dining | Churros, Cannelloni, Minestrone Soup, Hot Choc... | Cafe, Mexican, Italian | 800 | [('Rated 3.0', "RATED\n Ambience is not that ... | [] | Buffet | Banashankari |
| 3 | https://www.zomato.com/bangalore/addhuri-udupi... | 1st Floor, Annakuteera, 3rd Stage, Banashankar... | Addhuri Udupi Bhojana | No | No | 3.7/5 | 88 | +91 9620009302 | Banashankari | Quick Bites | Masala Dosa | South Indian, North Indian | 300 | [('Rated 4.0', "RATED\n Great food and proper... | [] | Buffet | Banashankari |
| 4 | https://www.zomato.com/bangalore/grand-village... | 10, 3rd Floor, Lakshmi Associates, Gandhi Baza... | Grand Village | No | No | 3.8/5 | 166 | +91 8026612447\r\n+91 9901210005 | Basavanagudi | Casual Dining | Panipuri, Gol Gappe | North Indian, Rajasthani | 600 | [('Rated 4.0', 'RATED\n Very good restaurant ... | [] | Buffet | Banashankari |
After slicing off the unwanted columns, below is the final list of columns:
# zomato.columns
set(zomato)
{'approx_cost(for two people)',
'book_table',
'cuisines',
'dish_liked',
'listed_in(city)',
'listed_in(type)',
'name',
'online_order',
'rate',
'rest_type',
'reviews_list',
'votes'}
zomato.rename(columns={'cuisines' : 'Cuisine', 'listed_in(city)': 'Locality', 'listed_in(type)' : 'Listed_Type', 'approx_cost(for two people)': 'Approx_Cost', 'name': 'Name', 'rest_type':'Restaurant_Type', 'rate' : 'Rating', 'votes' : 'Total_Votes', 'online_order': 'Online_Order','book_table':'Table_Booking', 'dish_liked' : 'Dishes_Liked' }, inplace = True)
zomato.columns
Index(['Name', 'Online_Order', 'Table_Booking', 'Rating', 'Total_Votes',
'Restaurant_Type', 'Dishes_Liked', 'Cuisine', 'Approx_Cost',
'reviews_list', 'Listed_Type', 'Locality', 'review_rates'],
dtype='object')
# Lets drop the data points with na and 0 rates:
zomato.drop(zomato[(zomato['review_rates'] == 0) & (zomato['Rating'].isna())].index, inplace=True)
# Replacing NaNs and '-' of rate with review rate values:
zomato['Rating'] = np.where(zomato.Rating.isna(), zomato.review_rates, zomato.Rating)
zomato['Rating'] = np.where(zomato['Rating'] == '-', zomato.review_rates, zomato.Rating)
#After dropping Nas
zomato.head()
| Name | Online_Order | Table_Booking | Rating | Total_Votes | Restaurant_Type | Dishes_Liked | Cuisine | Approx_Cost | reviews_list | Listed_Type | Locality | review_rates | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Jalsa | Yes | Yes | 4.1 | 775 | Casual Dining | Pasta, Lunch Buffet, Masala Papad, Paneer Laja... | North Indian, Mughlai, Chinese | 800 | [('Rated 4.0', 'RATED\n A beautiful place to ... | Buffet | Banashankari | 4.08 |
| 1 | Spice Elephant | Yes | No | 4.1 | 787 | Casual Dining | Momos, Lunch Buffet, Chocolate Nirvana, Thai G... | Chinese, North Indian, Thai | 800 | [('Rated 4.0', 'RATED\n Had been here for din... | Buffet | Banashankari | 3.57 |
| 2 | San Churro Cafe | Yes | No | 3.8 | 918 | Cafe, Casual Dining | Churros, Cannelloni, Minestrone Soup, Hot Choc... | Cafe, Mexican, Italian | 800 | [('Rated 3.0', "RATED\n Ambience is not that ... | Buffet | Banashankari | 3.15 |
| 3 | Addhuri Udupi Bhojana | No | No | 3.7 | 88 | Quick Bites | Masala Dosa | South Indian, North Indian | 300 | [('Rated 4.0', "RATED\n Great food and proper... | Buffet | Banashankari | 3.51 |
| 4 | Grand Village | No | No | 3.8 | 166 | Casual Dining | Panipuri, Gol Gappe | North Indian, Rajasthani | 600 | [('Rated 4.0', 'RATED\n Very good restaurant ... | Buffet | Banashankari | 4.00 |
# We have few columns with NaNs so lets drop these
print("Approx_Cost column has {} Nan values and Cuisine column has {} Nan values, we need to drop these.".format(sum(zomato.Approx_Cost.isna()), sum(zomato.Cuisine.isna())))
Approx_Cost column has 300 Nan values and Cuisine column has 17 Nan values, we need to drop these.
# print(type(zomato['Approx_Cost'][0]))
zomato['Rating'] = zomato['Rating'].apply(pd.to_numeric)
zomato['Approx_Cost'].replace(',', '',regex=True, inplace=True)
zomato['Approx_Cost'] = zomato['Approx_Cost'].apply(pd.to_numeric)
zomato['Name'].replace({r'[^\x00-\x7F]+':''}, regex = True, inplace=True)
# type(zomato['Approx_Cost'][0])
zomato.head()
| Name | Online_Order | Table_Booking | Rating | Total_Votes | Restaurant_Type | Dishes_Liked | Cuisine | Approx_Cost | Listed_Type | Locality | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Jalsa | Yes | Yes | 4.1 | 775 | Casual Dining | Pasta, Lunch Buffet, Masala Papad, Paneer Laja... | North Indian, Mughlai, Chinese | 800 | Buffet | Banashankari |
| 1 | Spice Elephant | Yes | No | 4.1 | 787 | Casual Dining | Momos, Lunch Buffet, Chocolate Nirvana, Thai G... | Chinese, North Indian, Thai | 800 | Buffet | Banashankari |
| 2 | San Churro Cafe | Yes | No | 3.8 | 918 | Cafe, Casual Dining | Churros, Cannelloni, Minestrone Soup, Hot Choc... | Cafe, Mexican, Italian | 800 | Buffet | Banashankari |
| 3 | Addhuri Udupi Bhojana | No | No | 3.7 | 88 | Quick Bites | Masala Dosa | South Indian, North Indian | 300 | Buffet | Banashankari |
| 4 | Grand Village | No | No | 3.8 | 166 | Casual Dining | Panipuri, Gol Gappe | North Indian, Rajasthani | 600 | Buffet | Banashankari |
zomato.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 43922 entries, 0 to 51716 Data columns (total 11 columns): Name 43922 non-null object Online_Order 43922 non-null object Table_Booking 43922 non-null object Rating 43922 non-null float64 Total_Votes 43922 non-null int64 Restaurant_Type 43766 non-null object Dishes_Liked 23353 non-null object Cuisine 43922 non-null object Approx_Cost 43922 non-null int64 Listed_Type 43922 non-null object Locality 43922 non-null object dtypes: float64(1), int64(2), object(8) memory usage: 4.0+ MB
fig1, ax1 = plt.subplots(figsize=(6,5));
ax1.pie(zomato.Online_Order.value_counts().values, autopct = '%1.1f%%', shadow=True);
ax1.legend(zomato.Online_Order.value_counts().index,
loc="lower right")
ax1.axis('equal');
ax1.set_title('Number of restaunrants with online ordering');
We see that almost 65% of the restaurants have the online ordering facility, as it is one the major source of income
zomato.Table_Booking.value_counts().plot.bar();
plt.title('Number of restaurants with online table booking');
plt.xticks(rotation = 0);
So here we see that most of the restaurants don't provide table booking option. They are very few, only around 5k out of almost 50k total so Lets look at the kind of restaurants these are.
plt.figure(figsize=(10,6))
ax = zomato[zomato['Table_Booking'] == 'Yes'].Restaurant_Type.value_counts(ascending = True).tail(20).plot.barh(y = 'Restaurant_Type');
plt.xlabel('Number of Restaurants')
plt.title('Distribution of Types of Restaurants with Table Booking option');
ylab = zomato[zomato['Table_Booking'] == 'Yes'].Restaurant_Type.value_counts(ascending = True).tail(20).values
for i, v in enumerate(ylab):
ax.text(v+50, i-0.25, np.around(v/sum(ylab)*100,2), color='black', fontsize = 10)
We can see from the graph that More than 55% of the restaurants with Table Booking option available are Casual Dining type and 9% are Cafes.
Lets check the average rating of Restaurants with and without Table Booking facility:
print(zomato[zomato['Table_Booking'] == 'No'].Rating.describe())
print("The average rating of the restaurants without table booking option is 3.6 with 50% higher than 3.7/5. Also size of this dataset is larger. Is the cost also lower at such restaurants?")
count 37602.000000 mean 3.596383 std 0.561743 min 0.000000 25% 3.300000 50% 3.700000 75% 3.900000 max 5.000000 Name: Rating, dtype: float64 The average rating of the restaurants without table booking option is 3.6 with 50% higher than 3.7/5. Also size of this dataset is larger. Is the cost also lower at such restaurants?
print(zomato[zomato['Table_Booking'] == 'Yes'].Rating.describe())
print("There are 6k restaurants with table booking facility and have an average rating of 4.1 with 50% of them having more than 4.3/5 rating which shows that restaurants with table booking receive better rating. Is it truly the case?")
count 6320.000000 mean 4.142900 std 0.302807 min 2.200000 25% 4.000000 50% 4.200000 75% 4.300000 max 5.000000 Name: Rating, dtype: float64 There are 6k restaurants with table booking facility and have an average rating of 4.1 with 50% of them having more than 4.3/5 rating which shows that restaurants with table booking receive better rating. Is it truly the case?
print("The minimum cost at restaurants with table booking is Rs.{} while the minimum cost at restaurants without table booking is Rs.{}".format(zomato[zomato['Table_Booking'] == 'Yes'].Approx_Cost.min(), zomato[zomato['Table_Booking'] == 'No'].Approx_Cost.min()))
The minimum cost at restaurants with table booking is Rs.300 while the minimum cost at restaurants without table booking is Rs.40
Do Restaurants with Table Booking actually receive higher rating? Lets test this hypothesis.
H0 : The mean rating of restaurants with table booking facility is same as mean rating of restaurants without the facility of booking table.
H1 : The mean rating of restaurants with table booking facility is geater than mean rating of restaurants without the facility of booking table.
H$_{0}$: $\mu_{1} == \mu_{2}$
H$_{1}$: $\mu_{1} > \mu_{2}$
Conditions for Hypothesis Testing:
- 1) Samples Are Random
- 2) It is less than 10% of the population
- 3) It can be considered normal as the size is larger than 30.
print('Statistics of a random sample of restaurants without table booking facility:')
rand_Rate_Booking_No.describe()
Statistics of a random sample of restaurants without table booking facility:
count 40.000000 mean 3.494000 std 0.557654 min 2.000000 25% 3.250000 50% 3.400000 75% 3.815000 max 5.000000 dtype: float64
print('Statistics of a random sample of restaurants with table booking facility:')
rand_Rate_Booking_Yes.describe()
Statistics of a random sample of restaurants with table booking facility:
count 30.000000 mean 4.146667 std 0.287358 min 3.700000 25% 3.925000 50% 4.150000 75% 4.375000 max 4.700000 dtype: float64
#However we will use the function to calculate out p - value:
t_stat, p = ttest_ind(rand_Rate_Booking_Yes,rand_Rate_Booking_No)
print('After running the T-test we get that:')
print('test statistic is', t_stat)
print('and p-Value is', p)
After running the T-test we get that: test statistic is 5.847388040064944 and p-Value is 1.5681416219944706e-07
if (p < 0.05):
print("Since P-value is lesser than significance level of 0.05, we reject the Null Hypothesis and Hence there is enough evidence that The mean rating of restaurants with table booking facility is geater than mean rating of restaurants without the facility of booking table. However we can not claim causal as there are other lurking variables.")
else:
print("Since p-value is greater than significance level of 0.05, we fail to reject the Null Hypothesis and hence there is enough evidence that the mean rating of table booking retaurants is same as mean rating of restaurants without the facility of booking table.")
Since P-value is lesser than significance level of 0.05, we reject the Null Hypothesis and Hence there is enough evidence that The mean rating of restaurants with table booking facility is geater than mean rating of restaurants without the facility of booking table. However we can not claim causal as there are other lurking variables.
plt.figure(figsize=(30,10))
plt.subplot(1,2,1)
sns.countplot(zomato[zomato['Table_Booking'] == 'Yes']['Approx_Cost'])
plt.xticks(rotation = 60)
plt.title('Rate distribution of restaurants with Booking table option');
plt.subplot(1,2,2)
# plt.figure(figsize=(20,5))
sns.countplot(zomato[zomato['Table_Booking'] == 'No']['Approx_Cost'])
plt.xticks(rotation = 60)
plt.title('Rate distribution of restaurants without Booking table option');
After analyzing further we see that restaurants with Table booking option have higher rating on average (From the Hypothesis) and are costlier than restaurants without booking table option. It start at 300 minimum where as the other type starts at Rs. 40.
From the above graph we see that there are fewer and fewer restaurants with rates above Rs.3000. Lets see what kind of restaurants these are?
print(zomato[zomato['Approx_Cost'] > 3000]['Restaurant_Type'].value_counts())
print('Restaurants with cost higher than 3000 are under Fine Dining. Average cost the restaunrant with Fine Dining is', zomato[zomato['Approx_Cost'] > 3000].Approx_Cost.mean())
Fine Dining 56 Fine Dining, Bar 21 Lounge 2 Name: Restaurant_Type, dtype: int64 Restaurants with cost higher than 3000 are under Fine Dining. Average cost the restaunrant with Fine Dining is 3800.0
plt.figure(figsize=(8,5))
zomato[zomato['Approx_Cost'] > 3000]['Name'].value_counts(ascending=True).plot.barh();
From the above analysis and graph, as expected we see that Fine Dining Restaurants are the costliest type and the most popular chain around bangaluru are Ritz Carlton, JW Marriot, Leela Palace and ITC.
plt.figure(figsize=(10, 6))
zomato.Restaurant_Type.value_counts(ascending=True).tail(20).plot.barh();
plt.title('Distribution of Restaurant types:');
From the graph above we see that Quick Bites is the dominating restaurant type in the market, followed by Casual Dining and Cafes. Lets further analyze Quick Bites to find the leading restaurant/ chains.
z1 = zomato[zomato['Restaurant_Type'] == 'Quick Bites']['Name'].value_counts().head(20).reset_index().rename(columns = {'index': 'Restaunrant_Name', 'Name':'Count'})
plt.figure(figsize=(10,10))
sns.barplot(y= 'Restaunrant_Name', x = 'Count', data = z1);
Suprisingly 5-Star chicken, a local brand has over taken McDonalds, KFC and Dominos Pizza around bangalore.
trace = go.Bar(x = zomato.Listed_Type.value_counts().keys(),
y = zomato.Listed_Type.value_counts(),
text = zomato.Listed_Type)
data = [trace]
layout = go.Layout(title = 'Distribution of Restaurants by Service Type', yaxis = dict(title = 'Number of Restaurants'))
fig = go.Figure(data = data, layout= layout)
py.iplot(fig);
Delivery is most popular and adopted by most of the restaurants restaurants!
plt.figure(figsize = (15,10));
# cmap = sns.cubehelix_palette(dark=.3, light=.8, as_cmap=True)
sns.scatterplot(x='Name', y='Approx_Cost', s=40, data = zomato, hue = 'Listed_Type', style = 'Listed_Type', palette='Set2');
plt.xticks('Off')
plt.title('Distribution of Approx_Cost across all Restaurants', size= 18);
Dine-out is more spread out from low cost to high cost meals restaurants.
c1 = ' '.join([text for text in z])
plt.figure(figsize=(10,10))
wordcloud = WordCloud(background_color = 'white', collocations = False, width=1500, height=1500).generate(c1)
plt.imshow(wordcloud)
plt.axis('off')
plt.title('Most popular Dishes in Bengaluru', size = 20)
plt.show()
c2 = ' '.join([st for st in zomato['Cuisine']])
plt.figure(figsize=(10,10))
wordcloud = WordCloud(background_color = 'white', collocations = True, width=1500, height=1500).generate(c2)
plt.imshow(wordcloud)
plt.axis('off')
plt.title('Most popular Cuisines in Bengaluru', size = 20)
plt.show()
plt.figure(figsize=(10,10))
zomato.Locality.value_counts(ascending=True).plot.barh();
Here we see that the locality with maximum restaurants is BTM with almost 3000 restaurants and Koramangala and least is New BEL Road.
plt.figure(figsize=(10,5))
# zomato.Approx_Cost.plot.hist()
sns.distplot(zomato.Approx_Cost)
# plt.xticks(np.arange(0, 4500, 500))
plt.xlim(0, 6000)
plt.xlabel('Approx. Cost for two')
plt.title("The Cost distribution");
We notice that the plot is right skewed with maximum cost being in range 0 to 1000 and very few going beyong 2000. What I would like to investigate further is what kind of restaurants cost higher than 2000 and how is their rating compared to other restaurants.
Lets check the distribution of rating of the restaurants.
plt.figure(figsize=(10,5))
# zomato.Rating.plot.hist();
sns.distplot(zomato.Rating);
Rating plot is left skewed with most of ratings between 3 to 4.5 and maximum at 4. Lets investigate further to see what kind of restaurants receive lower ratings and which restaurants have ratings higher than average.
Lets take a look at the restaurants with maximum number of branches in the city and are most popular.
plt.figure(figsize=(10,6));
zomato.Name.value_counts(ascending = True).tail(20).plot.barh();
CCD seems to have the largest number of chains in Bengaluru, followed by Onesta, Just Bake and Empire. Lets see what factors contribute to its success and popularity. Does CCD , Onesta have comparatively higher rating or cheaper cost?
trace = go.Scatter(x = zomato['Rating'], y = zomato['Total_Votes'], textposition = 'top center', text = zomato['Name'], mode = 'markers')
data1 = [trace]
layout = go.Layout(title = 'Votes v/s Rating', xaxis = dict(title = 'Average Rate'), yaxis = dict(title = 'Total no. of Votes'))
fig = go.Figure(data = data1, layout = layout)
py.iplot(fig);
We see that there is a positive correlation between no of votes and the rating. As the number of votes are increasing the rating of the restaunrant also is going up. Maximum reataunrants have received a rating between 4 and 5. Byg Brewski, Truffles and Absolute Barbeque have received the maximum votes and the highest rating among all the restaurants.
Lets see the relationship between Rating and Cost of meal at a restaurant
trace = go.Scatter(x = zomato['Rating'], y = zomato['Approx_Cost'], mode = 'markers', text = zomato['Name'])
data = [trace]
layout = go.Layout(title = 'Rating v/s Cost of meal at he restaurant', xaxis = dict(title = 'Rate'), yaxis = dict(title = 'Approx_Cost'))
fig = go.Figure(data = data, layout = layout)
py.iplot(fig);
Are Cost of Restaurants in a particular Locatility costlier than other?
plt.figure(figsize=(15,5))
sns.barplot(x = 'Locality', y = 'Approx_Cost', data = zly);
plt.xticks(rotation = 60);
plt.xlabel('Locality', size = 15)
plt.ylabel('Approximate Cost', size = 15)
plt.title('Average Cost of meal at restaurants by Locality', size = 15);
Here we see that Church Street, Brigade Road, Lavelle Road are among the costiest places for a meal and Banashankari has the cheapest meals.
zomato.groupby(['Listed_Type'])['Rating'].agg(np.mean).reset_index(). set_index('Listed_Type').sort_values(['Rating'], ascending = False).plot.bar();
plt.xticks(rotation = 60);
plt.ylabel('Rating');
Pubs and bars and drinks are highest rated category of restaurants.
sns.catplot(x = 'Online_Order', y = 'Total_Votes', kind="violin", data = zomato)
plt.title("Plot of Total votes v/s Online order", size = 15) ;
Now lets take a look at the ratings of the restaurants with maximum number of branches in the city and see of there is any relation:
fig = px.scatter(top_brands, x="Avg_Rating", y="#OfBranches", trendline="ols")
fig.show()
Seems like there is a relation between rating and number of branches opened for each brand.